Expectile Matrix Factorization for Skewed Data Analysis
نویسندگان
چکیده
Matrix factorization is a popular approach to solving matrix estimation problems based on partial observations. Existing matrix factorization is based on least squares and aims to yield a low-rank matrix to interpret the conditional sample means given the observations. However, in many real applications with skewed and extreme data, least squares cannot explain their central tendency or tail distributions, yielding undesired estimates. In this paper, we propose expectile matrix factorization by introducing asymmetric least squares, a key concept in expectile regression analysis, into the matrix factorization framework. We propose an efficient algorithm to solve the new problem based on alternating minimization and quadratic programming. We prove that our algorithm converges to a global optimum and exactly recovers the true underlying low-rank matrices when noise is zero. For synthetic data with skewed noise and a real-world dataset containing web service response times, the proposed scheme achieves lower recovery errors than the existing matrix factorization method based on least squares in a wide range of settings. Introduction Matrix estimation has wide applications in many fields such as recommendation systems (Koren, Bell, and Volinsky 2009), network latency estimation (Liao et al. 2013), computer vision (Chen and Suter 2004), system identification (Liu and Vandenberghe 2009), etc. In these problems, a low-rank matrix M∗ ∈ Rm×n or a linear mapping A(M∗) from the low-rank matrix M∗ is assumed to underlie some possibly noisy observations, where A : Rm×n → R. The objective is to recover the underlying low-rank matrix based on partial observations bi, i = 1, . . . , p. For example, a movie recommendation system aims to recover all user-movie preferences based on the ratings between some user-movie pairs (Koren, Bell, and Volinsky 2009; Su and Khoshgoftaar 2009), or based on implicit feedback, e.g., watching times/frequencies, that are logged for some users on some movies (Hu, Koren, and Volinsky 2008; Rendle et al. 2009). In network or web service latency estimation (Liao et al. 2013; Liu et al. 2015; Zheng, Zhang, and Lyu 2014), given partially collected latency measurements between some nodes that are possibly contaminated Copyright c © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. by noise, the goal is to recover the underlying low-rank latency matrix, which is present due to network path and function correlations. Matrix factorization is a popular approach for low-rank matrix estimation, in which the underlying matrix M∗ ∈ Rm×n is assumed to be M∗ = XY , with X ∈ Rm×k and Y ∈ Rn×k, such that the rank of M∗ is enforced to k. The goal is to find M̂ that minimizes the aggregate loss of the estimation A(M̂) on all observed samples bi, i = 1, . . . , p. Matrix factorization problems, although being nonconvex, can be solved efficiently at a large scale by several standard optimization methods such as alternating minimization and stochastic gradient descent. As a result, matrix factorization has gained enormous success in real-world recommender systems, e.g., Netflix Prize competition (Koren, Bell, and Volinsky 2009), and large-scale network latency estimation, e.g., DMFSGD (Liao et al. 2013), due to its scalability, low computation cost per iteration, and the ease of distributed implementation. In contrast, another approach to matrix estimation and completion, namely nuclear-norm minimization (Candès and Tao 2010; Candes and Plan 2010) based on SVT (Cai, Candès, and Shen 2010) or proximal gradient methods (Ma, Goldfarb, and Chen 2011), is relatively less scalable to problems of huge sizes due to high computational cost per iteration (Sun and Luo 2015). Recently, a few studies (Sun and Luo 2015; Jain, Netrapalli, and Sanghavi 2013; Zhao, Wang, and Liu 2015) have also theoretically shown that many optimization algorithms converge to the global optimality of the matrix factorization formulation, and can recover the underlying true low-rank matrix under certain conditions. Nevertheless, a common limitation of almost all existing studies on matrix estimation is that they have ignored the fact that observations in practice could be highly skewed and do not follow symmetric normal distributions in many applications. For example, latencies to web services over the Internet are highly skewed, in that most measurements are within hundreds of milliseconds while a small portion of outliers could be over several seconds due to network congestion or temporary service unavailability (Zheng, Zhang, and Lyu 2014; Liu et al. 2015). In a video recommender system based on implicit feedback (e.g., user viewing history), the watching time is also highly skewed, in the sense that a user may watch most videos for a short period of time and only finish a few videos that he or she truly likes (Hu, Koren, and Volinsky 2008). In other words, the majority of existing matrix factorization methods are based on least squares and attempt to produce a low-rank matrix M̂ such that A(M̂) estimates the conditional means of observations. However, in the presence of extreme and skewed data, this may incur large biases and may not fulfill practical requirements. For example, in web service latency estimation, we want to find the most probable latency between each client-service pair instead of its conditional mean that is biased towards large outliers. Alternatively, one may be interested in finding the tail latencies and exclude the services with long latency tails from being recommended to a client. Similarly, in recommender systems based on implicit feedback, predicting the conditional mean watching time of each user on a video is meaningless due to the skewness of watching times. Instead, we may want to find out the most likely time length that the user might spend on the video, and based the recommendation on that. For asymmetric, skewed and heavy-tailed data that are prevalent in the real world, new matrix factorization techniques need to be developed beyond symmetric least squares, in order to achieve robustness to outliers and to better interpret the central tendency or dispersion of observations. In this paper, we propose the concept of expectile matrix factorization (EMF) by replacing the symmetric least squares loss function in conventional matrix factorization with a loss function similar to those used in expectile regression (Newey and Powell 1987). Our scheme is different from weighted matrix factorization (Singh and Gordon 2008), in that we not only assign different weights to different residuals, but assign each weight conditioned on whether the residual is positive or negative. Intuitively speaking, our expectile matrix factorization problem aims to produce a low-rank matrix M̂ such that A(M̂) can estimate any ωth conditional expectiles of the observations, not only enhancing the robustness to outliers, but also offering more sophisticated statistical understanding of observations from a matrix beyond mean statistics. We make multiple contributions in this paper. First, we propose an efficient algorithm based on alternating minimization and quadratic programming to solve expectile matrix factorization, which has low complexity similar to that of alternating least squares in conventional matrix factorization. Second, we theoretically prove that under certain conditions, expectile matrix factorization retains the desirable properties that without noise, it achieves the global optimality and exactly recovers the true underlying low-rank matrices. This result generalizes the prior result (Zhao, Wang, and Liu 2015) regarding the optimality of alternating minimization for matrix estimation under the symmetric least squares loss (corresponding to ω = 0.5 in EMF) to a general class of “asymmetric least squares” loss functions for any ω ∈ (0, 1). The results are obtained by adapting a powerful tool we have developed on the theoretical properties of weighted matrix factorization involving varying weights across iterations. Third, for data generated from a low-rank matrix contaminated by skewed noise, we show that our schemes can achieve better approximation to the original low-rank matrix than conventional matrix factorization based on least squares. Finally, we also performed extensive evaluation based on a real-world dataset containing web service response times between 339 clients and 5825 web services distributed worldwide. We show that the proposed EMF saliently outperforms the state-of-the-art matrix factorization scheme based on least squares in terms of web service latency recovery from only 5-10% of samples. Notation: Without specification, any vector v = (v1, . . . , vp) T ∈ R is a column vector. We denote its lp norm as ‖v‖p = (∑ j v p j )1/p . For a matrix A ∈ Rm×n, we denote Aij as its (i, j)-entry. We denote the singular values of A as σ1(A) ≥ σ2(A) ≥ . . . ≥ σk(A), where k = rank(A). Sometimes we also denote σmax(A) as its maximum singular value and σmin(A) as its minimum singular value. We denote ‖A‖F = √∑ j σ 2 j as its Frobenius norm and ‖A‖2 = σmax(A) as its spectral norm. For any two matrices A,B ∈ Rm×n, we denote their inner product 〈A,B〉 = tr(AB) = ∑ i,j AijBij . For a bivariate function f(x, y), we denote the partial gradient w.r.t. x as ∇xf(x, y) and that w.r.t. y as∇yf(x, y). Expectile Matrix Factorization Given a linear mapping A : Rm×n → R, we can get p observations of anm×nmatrixM∗ ∈ Rm×n. In particular, we can decompose the linear mapping A into p inner products, i.e., 〈Ai,M〉 for i = 1, . . . , p, with Ai ∈ Rm×n. Denote the p observations by a column vector b = (b1, . . . , bp) ∈ R, where bi is the observation of 〈Ai,M〉 and may contain independent random noise. The matrix estimation problem is to recover the underlying true matrix M∗ from observations b, assuming that M∗ has a low rank. Matrix factorization assumes that the matrix M∗ has a rank no more than k, and can be factorized into two tall matrices X ∈ Rm×k and Y ∈ Rn×k with k {m,n, p}. Specifically, it estimates M∗ by solving the following nonconvex optimization problem: min X∈Rm×k,Y ∈Rn×k p ∑ i=1 L(bi, 〈Ai,M〉) s.t. M = XY , where L(·, ·) is a loss function. We denote the optimal solution to the problem above by M̂ . The most common loss function used in matrix factorization is the squared loss (bi − 〈Ai, XY 〉), with which the problem is to minimize the mean squared error (MSE): min X∈Rm×k,Y ∈Rn×k p ∑
منابع مشابه
A new approach for building recommender system using non negative matrix factorization method
Nonnegative Matrix Factorization is a new approach to reduce data dimensions. In this method, by applying the nonnegativity of the matrix data, the matrix is decomposed into components that are more interrelated and divide the data into sections where the data in these sections have a specific relationship. In this paper, we use the nonnegative matrix factorization to decompose the user ratin...
متن کاملIterative Weighted Non-smooth Non-negative Matrix Factorization for Face Recognition
Non-negative Matrix Factorization (NMF) is a part-based image representation method. It comes from the intuitive idea that entire face image can be constructed by combining several parts. In this paper, we propose a framework for face recognition by finding localized, part-based representations, denoted “Iterative weighted non-smooth non-negative matrix factorization” (IWNS-NMF). A new cost fun...
متن کاملA Projected Alternating Least square Approach for Computation of Nonnegative Matrix Factorization
Nonnegative matrix factorization (NMF) is a common method in data mining that have been used in different applications as a dimension reduction, classification or clustering method. Methods in alternating least square (ALS) approach usually used to solve this non-convex minimization problem. At each step of ALS algorithms two convex least square problems should be solved, which causes high com...
متن کاملNew Bases for Polynomial-Based Spaces
Since it is well-known that the Vandermonde matrix is ill-conditioned, while the interpolation itself is not unstable in function space, this paper surveys the choices of other new bases. These bases are data-dependent and are categorized into discretely l2-orthonormal and continuously L2-orthonormal bases. The first one construct a unitary Gramian matrix in the space l2(X) while the late...
متن کاملOn the computation of multivariate scenario sets for the skew-t and generalized hyperbolic families
We examine the problem of computing multivariate scenarios sets for skewed distributions. Our interest is motivated by the potential use of such sets in the stress testing of insurance companies and banks whose solvency is dependent on changes in a set of financial risk factors. We define multivariate scenario sets based on the notion of half-space depth (HD) and also introduce the notion of ex...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017